Skip to content

fix(keyshare): align halt failure propagation and dkg timeout budget#1530

Open
hmzakhalid wants to merge 1 commit intomainfrom
fix/e3-halt-hlt-003
Open

fix(keyshare): align halt failure propagation and dkg timeout budget#1530
hmzakhalid wants to merge 1 commit intomainfrom
fix/e3-halt-hlt-003

Conversation

@hmzakhalid
Copy link
Copy Markdown
Member

@hmzakhalid hmzakhalid commented May 6, 2026

closes #1531

Summary by CodeRabbit

  • New Features

    • Implemented flexible, window-based timeout management for distributed key generation operations with per-component configuration support via environment variables.
  • Documentation

    • Updated distributed key generation documentation with new timeout behavior, configuration options, and failure scenarios.
  • Bug Fixes

    • Enhanced telemetry preservation and error propagation during distributed key generation timeout events.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
crisp Ready Ready Preview, Comment May 6, 2026 9:58am
enclave-docs Ready Ready Preview, Comment May 6, 2026 9:58am

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

This PR introduces derived, budget-based DKG timeout handling by replacing fixed per-stage timeouts with a DKG-window scheme (default 7200s) and per-phase cutoff percentages. A new timeout policy module computes timeouts for encryption key, threshold share, and decryption key collection phases, passed as explicit parameters to each collector during initialization.

Changes

DKG Timeout Policy Integration

Layer / File(s) Summary
Core Timeout Policy
crates/keyshare/src/timeout_policy.rs
New module defining DkgTimeoutPhase enum with three phases, cutoff basis point constants (EncryptionKey: 10%, ThresholdShare: 60%, DecryptionKeyShared: 30%), and utilities to resolve computed timeouts from DKG start time and window budget; includes comprehensive unit tests for timeout calculations.
State & Tracking
crates/keyshare/src/threshold_keyshare.rs (lines 62, 265–267, 299–300)
Add dkg_started_at_unix_secs: Option<u64> field to ThresholdKeyshareState, initialized at state creation to track when DKG began.
Collector Interface Updates
crates/keyshare/src/encryption_key_collector.rs, crates/keyshare/src/threshold_share_collector.rs, crates/keyshare/src/decryption_key_shared_collector.rs
Each collector now accepts explicit timeout: Duration parameter in setup() and stores it as a private field; replaces environment-based timeout computation with per-instance timeout scheduling.
Timeout Orchestration
crates/keyshare/src/threshold_keyshare.rs (lines 470–482, 500–513, 537–549)
ThresholdKeyshare computes per-phase timeouts via resolve_timeout() using DKG start time and phase, then passes the derived duration to each collector during initialization.
Failure Handling & Tests
crates/keyshare/src/threshold_keyshare.rs (lines 2524–2531, 2558–2564)
Update collection failure paths to publish events and emit E3Failed without context, consistent with new timeout regime; add unit test module covering encryption key, threshold share, and decryption key collection failure scenarios.
Module Declaration
crates/keyshare/src/lib.rs
Add mod timeout_policy to expose the new timeout module internally.
Flow Documentation
agent/flow-trace/04_DKG_AND_COMPUTATION.md
Update Step 3, Step 6, and Step 6a to document derived timeout behavior, including DKG-window-based cutoff application, event publication, and actor stop semantics on timeout.

Sequence Diagram

sequenceDiagram
    participant TK as ThresholdKeyshare
    participant TP as timeout_policy
    participant EKC as EncryptionKeyCollector
    participant TSC as ThresholdShareCollector

    TK->>TK: Initialize state with dkg_started_at_unix_secs
    TK->>TP: resolve_timeout(EncryptionKeyCollection, dkg_start_time)
    TP-->>TK: DerivedTimeout { duration, description }
    TK->>EKC: setup(..., timeout_duration)
    EKC->>EKC: Store timeout, schedule expiration
    
    TK->>TP: resolve_timeout(ThresholdShareCollection, dkg_start_time)
    TP-->>TK: DerivedTimeout { duration, description }
    TK->>TSC: setup(..., timeout_duration)
    TSC->>TSC: Store timeout, schedule expiration
    
    Note over EKC,TSC: Collection phases execute<br/>with derived timeouts
    EKC->>EKC: timeout fires
    EKC->>TK: EncryptionKeyCollectionFailed
    TK->>TK: Emit E3Failed, republish telemetry, stop
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

  • gnosisguild/enclave#1377: Modifies the same keyshare collectors and threshold_keyshare orchestration; this PR introduces derived per-phase timeouts while the related PR adjusts timeout constants and adds C6/C7 proof flows.

Suggested Reviewers

  • 0xjei
  • cedoor

Poem

🐰 A rabbit hops through timeout clocks,
From fixed to flowing DKG rocks,
Each phase now budgets from one grand plan,
Windows and fractions, calculated fair and tan! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.24% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the two main changes: DKG timeout handling refactoring and failure propagation alignment in the keyshare module.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/e3-halt-hlt-003

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/keyshare/src/threshold_keyshare.rs (1)

266-299: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Start the DKG timeout clock on CiphernodeSelected, not in ThresholdKeyshareState::new.

resolve_timeout(...) subtracts elapsed time from dkg_started_at_unix_secs, but Line 298 seeds that value when the actor/state is constructed. If the actor exists for any meaningful time before DKG actually begins, the collectors inherit a reduced or even exhausted budget and can timeout immediately on startup. Initialize this field to None here and stamp it when handle_ciphernode_selected begins the DKG flow, before the collectors are created.

💡 Suggested direction
 pub struct ThresholdKeyshareState {
     ...
     #[serde(default)]
     pub dkg_started_at_unix_secs: Option<u64>,
     ...
 }

 impl ThresholdKeyshareState {
     pub fn new(
         e3_id: E3id,
         party_id: PartyId,
         state: KeyshareState,
         threshold_m: u64,
         threshold_n: u64,
         params: ArcBytes,
         address: String,
         proof_aggregation_enabled: bool,
     ) -> Self {
         Self {
             e3_id,
             address,
             party_id,
             state,
             threshold_m,
             threshold_n,
             params,
             aggregated_pk: None,
             expelled_parties: HashSet::new(),
             honest_parties: None,
-            dkg_started_at_unix_secs: Some(now_unix_secs()),
+            dkg_started_at_unix_secs: None,
             proof_aggregation_enabled,
         }
     }
 }

Then, in handle_ciphernode_selected, set dkg_started_at_unix_secs = Some(now_unix_secs()) before calling ensure_collector(...) / ensure_encryption_key_collector(...).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/keyshare/src/threshold_keyshare.rs` around lines 266 - 299, The
dkg_started_at_unix_secs is being initialized in ThresholdKeyshareState::new
which causes resolve_timeout to undercount if the actor lives before DKG starts;
change the initialization in ThresholdKeyshareState::new to set
dkg_started_at_unix_secs to None, and then in handle_ciphernode_selected set
self.dkg_started_at_unix_secs = Some(now_unix_secs()) immediately before calling
ensure_collector(...) and ensure_encryption_key_collector(...), leaving
resolve_timeout to compute elapsed correctly.
🧹 Nitpick comments (1)
crates/keyshare/src/timeout_policy.rs (1)

129-134: ⚡ Quick win

Warn on invalid timeout env values instead of silently ignoring them.

Right now malformed or zero-valued overrides fall back to the default path with no signal. For a rollout-sensitive timeout policy, that makes misconfiguration very hard to spot in production. A small warning here would make bad overrides immediately visible.

💡 Suggested change
+use tracing::warn;
+
 fn parse_env_secs(name: &str) -> Option<u64> {
-    std::env::var(name)
-        .ok()
-        .and_then(|value| value.parse::<u64>().ok())
-        .filter(|secs| *secs > 0)
+    match std::env::var(name) {
+        Ok(value) => match value.parse::<u64>() {
+            Ok(secs) if secs > 0 => Some(secs),
+            _ => {
+                warn!(env = name, value = %value, "Ignoring invalid timeout override");
+                None
+            }
+        },
+        Err(_) => None,
+    }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/keyshare/src/timeout_policy.rs` around lines 129 - 134, The
parse_env_secs function currently drops malformed or zero values silently;
update parse_env_secs to detect when the environment variable is present but
fails to parse or is <= 0 and emit a warning (using the project's logging
facility, e.g., tracing::warn! or log::warn!) that includes the variable name
and the invalid value, then continue returning None for the default behavior;
locate the function parse_env_secs and add the warning branch where
.ok().and_then(...).filter(...) would otherwise swallow the bad input.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/keyshare/src/threshold_keyshare.rs`:
- Around line 266-299: The dkg_started_at_unix_secs is being initialized in
ThresholdKeyshareState::new which causes resolve_timeout to undercount if the
actor lives before DKG starts; change the initialization in
ThresholdKeyshareState::new to set dkg_started_at_unix_secs to None, and then in
handle_ciphernode_selected set self.dkg_started_at_unix_secs =
Some(now_unix_secs()) immediately before calling ensure_collector(...) and
ensure_encryption_key_collector(...), leaving resolve_timeout to compute elapsed
correctly.

---

Nitpick comments:
In `@crates/keyshare/src/timeout_policy.rs`:
- Around line 129-134: The parse_env_secs function currently drops malformed or
zero values silently; update parse_env_secs to detect when the environment
variable is present but fails to parse or is <= 0 and emit a warning (using the
project's logging facility, e.g., tracing::warn! or log::warn!) that includes
the variable name and the invalid value, then continue returning None for the
default behavior; locate the function parse_env_secs and add the warning branch
where .ok().and_then(...).filter(...) would otherwise swallow the bad input.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d59d5bea-4e0b-47f1-9124-05a09e817373

📥 Commits

Reviewing files that changed from the base of the PR and between c7e9802 and b26f4e3.

📒 Files selected for processing (7)
  • agent/flow-trace/04_DKG_AND_COMPUTATION.md
  • crates/keyshare/src/decryption_key_shared_collector.rs
  • crates/keyshare/src/encryption_key_collector.rs
  • crates/keyshare/src/lib.rs
  • crates/keyshare/src/threshold_keyshare.rs
  • crates/keyshare/src/threshold_share_collector.rs
  • crates/keyshare/src/timeout_policy.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HLT-003: Early DKG collector failures stop locally without failing the round

1 participant